Week 5 of 12 · Part A — Applied Safety

The Attack-Surface Map

Locking in Week 5 — one defensible page that says where a system can be hit and what holds

Day 25 ~50 minutes Review

Day 25 of 60

What you now hold

Five days ago, robustness was probably a word that meant "the model is good." Now it's a discipline. You can explain why jailbreaks work (competing objectives, mismatched generalization), why prompt injection is an architectural problem and not a prompting one, why agents turn content vulnerabilities into action vulnerabilities, and — with the Go result — why capability never implies robustness. And you've built the artifact that ties it together: a coverage matrix that shows which layer stops which attack and where the gaps are.

The through-line of Week 5

A capable model is a soft layer that will sometimes be fooled. Robustness is what you build around it: layered, measured, monitored defenses, with the dangerous capabilities gated so that when a trick works, it still can't do anything catastrophic. You don't hope the model isn't jailbreakable — you assume it is, and engineer for that.

The attack-surface map — your synthesis artifact

This week's capstone is a single page you could hand to a team before shipping a tool-using agent: an attack-surface map. It names each way in, the defenses on that path, and the honest residual gap. Building it forces every concept from the week into one place.

The Map

1 · Entry points

Where untrusted input reaches the model: the user's own messages (direct jailbreak), retrieved content (indirect injection — web, email, files, tool output), and non-text channels (multimodal evasion). One row per entry point.

2 · Defenses per path

For each entry point, which of your layers apply — input filter, system prompt, safety tuning, output filter, provenance, least privilege, human-in-the-loop, action monitoring. Pull this straight from your Day 23 matrix.

3 · Residual gap & where guarantees are possible

Name the path with the thinnest coverage, and be honest about where you have real guarantees versus best-effort. Some properties can be formally verified or hard-gated (an agent that physically lacks a permission cannot use it); most model-level defenses are statistical and bypassable. Mark which is which — that honesty is the artifact's value.

Where verification helps — and where it doesn't

Formal verification and hard permission boundaries give real guarantees about what the system is allowed to do — and those guarantees hold even against attacks no one imagined. They cannot, today, guarantee what a language model will say on every input. So the durable defenses are the architectural ones: gate the capabilities, and the model's softness stops being catastrophic. That's the single most important sentence to carry out of Part A's safety-engineering arc.

Self-quiz — can you do these without notes?

Prove the Week

~50 minutes

Explain why safety tuning is brittle using both competing objectives and mismatched generalization — see Jailbroken if you stall.
Define direct vs indirect prompt injection and state why agents raise the stakes — ground it in Indirect Prompt Injection.
Name the four defensive layers from your coverage matrix and one attack each catches — and one each misses.
Tell the Adversarial Policies (Go) story from memory and say why it proves capability ≠ robustness.
Finalize your attack-surface map for one agent deployment, and write your Week 5 summary plus the one robustness question you most want answered later.

The expert move

A practitioner lists the attacks they know. An expert hands you a map: every entry point, the layered defenses on it, the honest residual gap, and a clear line between what's guaranteed and what's best-effort. The altitude jump is from cataloguing threats to governing them — owning the one page that decides whether a system is ready to ship, and being able to defend every cell of it in a review.

Say this in an interview: "Before shipping an agent I build an attack-surface map: entry points, defenses per path, residual gaps, and a clear line between hard guarantees and statistical best-effort. I assume the model can be jailbroken, so I gate the dangerous capabilities — that way a successful trick still can't cause a catastrophe. Robustness is the stack, never the model alone."

Week 5 Takeaways

Jailbreaks come from competing objectives and mismatched generalization; injection is an architectural gap, not a prompting one.
Capability ≠ robustness — the Go result is your portable proof.
Robustness is the whole stack: measure per-attack coverage, gate dangerous capabilities, monitor actions.
Next: Part B — from observable behavior to the deeper alignment problem underneath it.